RED WINE QUALITY by SAMMIT RANADE

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

From the above summary, we can see from the statistics how each object varies from Minimum to Maximum and their means, medians,etc.

The dataset I am planning to explore is the Red Wine Quality dataset. The guiding question for this dataset is “Which chemical properties influence the quality of red wines?”.This tidy data set contains 1,599 red wines with 11 variables on the chemical properties of the wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent). This data set is a tidy set hence there is no need to clean it. It contains some important characteristics of red wines such as its pH, density, alcohol% present etc.

Univariate Plots Section

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

Reason: To study univariate analysis of each variable in the dataset. Reflection: ‘pH’ is plotted on X-axis. And from below all the rest variables have been plotted on X-axis for studying them.

The below plot is shown by eliminating the outliers.

The below plot is shown by eliminating the outliers.

The below plot is shown by eliminating the outliers.

The below plot is shown by eliminating the outliers.

The below plot is shown by eliminating the outliers.

The below plot is shown by eliminating the outliers.

The below plot is shown by eliminating the outliers.

The below plot is shown by eliminating the outliers.

The below plot is shown by eliminating the outliers.

Reason: To get better review of the histogram. Reflection: The X-axis has been limited to pH 3.0 to 3.75 for better understanding of the Histogram.

Reason: To find if there are any variations by using logarithmic and sqrt functions Reflection: Here, we cannot see much difference in the histograms hence there is no need to transform data, we also need not compare any two features, so the box-plots haven’t been used!

Univariate Analysis

What is the structure of your dataset?

The Structure of this dataset is a dataframe

What is/are the main feature(s) of interest in your dataset?

The main features of interest in this dataset are, ‘density’, ‘pH’, ‘alcohol’, ‘quality’

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

‘fixed.acidity’, ‘residual.sugar’, ‘citric.acid’ such other features will help support the investigation into the features of interest.

Did you create any new variables from existing variables in the dataset?

There was no need here to create any new variables from existing variables in the dataset.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the

form  of the data? If so, why did you do this?

No, there were not any unsual distributions seen in the features investigated. Also, there was no need to perform any operations for data munging or tidying of data as this dataset is already to structured, tidy and easy to work with.

Bivariate Plots Section

Based on what you saw in the univariate plots, what relationships between variables might be interesting to look at in this section? Don’t limit yourself to relationships between a main output feature and one of the supporting variables. Try to look at relationships between supporting variables as well.

Reason: The below plot can be properly studied as compared to the above one. Reflection: density is plotted against pH in the above and below plots.

Reason: We can check if there is any relation between these two. Reflection: fixed acdity and volatile acidity with square root of volatile acidity are plotted

Reason: To get to know the relation between free sulfur dioxide and total sulfur dioxide. We come to know how much one is dependent on other. Reflection: scatterplot of free sulfur dioxide and total sulfur dioxide is seen

The mean of total.sulfur.dioxide against sulphates is seen in this plot. This plot the variation between the total.sulfur.dioxide and sulphates hence I have chosen this plot. Also I found it as interesting variation even though these two are not the main features of interest.

## # A tibble: 6 x 4
##   alcohol mean_quality median_quality     n
##     <dbl>        <dbl>          <dbl> <int>
## 1    8.40         4.50           4.50     2
## 2    8.50         5.00           5.00     1
## 3    8.70         6.00           6.00     2
## 4    8.80         5.00           5.00     2
## 5    9.00         5.40           6.00    30
## 6    9.05         4.00           4.00     1

Reason: To see how much the quality varies alongwith the alcohol Reflection: We see a downfall with wines containing 10% alcohol in them. hmm…interesting!

## 
##  Pearson's product-moment correlation
## 
## data:  pH and quality
## t = -2.3109, df = 1597, p-value = 0.02096
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.106451268 -0.008734972
## sample estimates:
##         cor 
## -0.05773139
## 
##  Pearson's product-moment correlation
## 
## data:  alcohol and quality
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4373540 0.5132081
## sample estimates:
##       cor 
## 0.4761663
## 
##  Pearson's product-moment correlation
## 
## data:  density and quality
## t = -7.0997, df = 1597, p-value = 1.875e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2220365 -0.1269870
## sample estimates:
##        cor 
## -0.1749192

Reason: to find the strongest relationship between two characteristics. Reflection: Correaltion test is performed to check how strong the relationship is between two characteristics. Alcohol and quality is the highest.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features

in  the dataset?

Relation between Alcohol and quality is the strongest realtionship seen.We can see that majorly the quality of Red wines depends upon the alcohol percentage in them. Then comes the acidity and pH and then density.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

(1)free.sulfur.dioxide and total.sulfur.dioxide (2)fixed.acidity and volatile.acidity. These two relationships would catch interest which are other than main features we need.

What was the strongest relationship you found?

The realtionship between Alcohol and quality is the strongest relationship seen in the main features of interest because the Pearson’s coeffiecient is the highest (0.476)

Multivariate Plots Section

## # A tibble: 6 x 5
##   alcohol density mean_quality median_quality     n
##     <dbl>   <dbl>        <dbl>          <dbl> <int>
## 1    8.40   0.999         3.00           3.00     1
## 2    8.40   1.00          6.00           6.00     1
## 3    8.50   0.999         5.00           5.00     1
## 4    8.70   0.997         6.00           6.00     1
## 5    8.70   0.999         6.00           6.00     1
## 6    8.80   1.00          5.00           5.00     2

Reason: To study multivariate plots a new dataframe is introduced. Reflection: A dataframe is made containg the quality with alcohol and density for easy comparision.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of

or looking at your feature(s) of interest?

There was not much strengthening of relationships as compared to the bivariate analysis. Wines containing around 8%-12% of alcohol and 0.995-1.000g/cc denser are the better types of wines!

Were there any interesting or surprising interactions between features?

To my suprise, the multivariate plots did not show much interesting relations compared to the bivariate plots.


Final Plots and Summary

Plot One

Description One

This plot shows that the wines which are acidic with pH around 3.2-3.4 and denser 0.995g/cc-1.000g/cc are better wines. This plot gives us a good idea of the characteristics of Wines hence I have chosen this plot.

Plot Two

Description Two

The red line which we see is the mean line for total.sulfur.dioxide. The mean of total.sulfur.dioxide against sulphates is seen in this plot. This plot the variation between the total.sulfur.dioxide and sulphates hence I have chosen this plot. Also I found it as interesting variation even though these two are not the main features of interest.

Plot Three

Description Three

The above scatterplot shows that the red wines with density 0.995g/cc to 1.000g/cc containing 8%-12% alcohol are better. This plot gives a better/clear idea of the characteristics hence I have chosen this plot.


Reflection

The dataset I chose to explore is the Red Wine Quality. I chose this dataset just out of interest :) ! This dataset was already structured as a result I did not need to restructure it or data munging was not necessary. I found that the wines containing alcohol 8%-12%, density 0.995g/cc-1.000g/cc and pH 3.2-3.4 are better wines.While exploring, my first hurdle was to choose the correct characteristics to work on and explore. Then when I started plotting, as I went on exploring I found it easy to work with. Plotting the bivariate plotting was helpful more than the multivariate plotting. I also found an interesting realtionship between total.sulfur.dioxide and sulphates present in the wines. We can add more data in the present dataset of the present wines and hence work on the same modified dataset to check in the future if there are any changes in the current findings and the reason behind the changes in this time period!